Information about the Data Set

This data set logs every shot during the NBA 2014-2015 regular season from October 2014-March 2015 (Important to note that this is not the span of the entire season) including a variety of factors that are relevant to the shot. We obtained this data from https://www.kaggle.com/dansbecker/nba-shot-logs, the data is scraped from NBA’s REST API.

shots <- read_csv("shot_logs.csv")
## Rows: 128069 Columns: 21
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr   (6): MATCHUP, LOCATION, W, SHOT_RESULT, CLOSEST_DEFENDER, player_name
## dbl  (14): GAME_ID, FINAL_MARGIN, SHOT_NUMBER, PERIOD, SHOT_CLOCK, DRIBBLES,...
## time  (1): GAME_CLOCK
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

A Brief Exploration of the Data Set

To get the gist of the data set, we will utilize the summary, structure, and head functions to depict the different aspects of the variables and observations the data set has.

summary(shots)
##     GAME_ID           MATCHUP            LOCATION              W            
##  Min.   :21400001   Length:128069      Length:128069      Length:128069     
##  1st Qu.:21400233   Class :character   Class :character   Class :character  
##  Median :21400449   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :21400452                                                           
##  3rd Qu.:21400673                                                           
##  Max.   :21400908                                                           
##                                                                             
##   FINAL_MARGIN       SHOT_NUMBER         PERIOD       GAME_CLOCK      
##  Min.   :-53.0000   Min.   : 1.000   Min.   :1.000   Length:128069    
##  1st Qu.: -8.0000   1st Qu.: 3.000   1st Qu.:1.000   Class1:hms       
##  Median :  1.0000   Median : 5.000   Median :2.000   Class2:difftime  
##  Mean   :  0.2087   Mean   : 6.507   Mean   :2.469   Mode  :numeric   
##  3rd Qu.:  9.0000   3rd Qu.: 9.000   3rd Qu.:3.000                    
##  Max.   : 53.0000   Max.   :38.000   Max.   :7.000                    
##                                                                       
##    SHOT_CLOCK       DRIBBLES        TOUCH_TIME         SHOT_DIST    
##  Min.   : 0.00   Min.   : 0.000   Min.   :-163.600   Min.   : 0.00  
##  1st Qu.: 8.20   1st Qu.: 0.000   1st Qu.:   0.900   1st Qu.: 4.70  
##  Median :12.30   Median : 1.000   Median :   1.600   Median :13.70  
##  Mean   :12.45   Mean   : 2.023   Mean   :   2.766   Mean   :13.57  
##  3rd Qu.:16.68   3rd Qu.: 2.000   3rd Qu.:   3.700   3rd Qu.:22.50  
##  Max.   :24.00   Max.   :32.000   Max.   :  24.900   Max.   :47.20  
##  NA's   :5567                                                       
##     PTS_TYPE     SHOT_RESULT        CLOSEST_DEFENDER  
##  Min.   :2.000   Length:128069      Length:128069     
##  1st Qu.:2.000   Class :character   Class :character  
##  Median :2.000   Mode  :character   Mode  :character  
##  Mean   :2.265                                        
##  3rd Qu.:3.000                                        
##  Max.   :3.000                                        
##                                                       
##  CLOSEST_DEFENDER_PLAYER_ID CLOSE_DEF_DIST        FGM              PTS        
##  Min.   :   708             Min.   : 0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:101249             1st Qu.: 2.300   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :201949             Median : 3.700   Median :0.0000   Median :0.0000  
##  Mean   :159039             Mean   : 4.123   Mean   :0.4521   Mean   :0.9973  
##  3rd Qu.:203079             3rd Qu.: 5.300   3rd Qu.:1.0000   3rd Qu.:2.0000  
##  Max.   :530027             Max.   :53.200   Max.   :1.0000   Max.   :3.0000  
##                                                                               
##  player_name          player_id     
##  Length:128069      Min.   :   708  
##  Class :character   1st Qu.:101162  
##  Mode  :character   Median :201939  
##                     Mean   :157238  
##                     3rd Qu.:202704  
##                     Max.   :204060  
## 
str(shots)
## spec_tbl_df [128,069 x 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ GAME_ID                   : num [1:128069] 21400899 21400899 21400899 21400899 21400899 ...
##  $ MATCHUP                   : chr [1:128069] "MAR 04, 2015 - CHA @ BKN" "MAR 04, 2015 - CHA @ BKN" "MAR 04, 2015 - CHA @ BKN" "MAR 04, 2015 - CHA @ BKN" ...
##  $ LOCATION                  : chr [1:128069] "A" "A" "A" "A" ...
##  $ W                         : chr [1:128069] "W" "W" "W" "W" ...
##  $ FINAL_MARGIN              : num [1:128069] 24 24 24 24 24 24 24 24 24 1 ...
##  $ SHOT_NUMBER               : num [1:128069] 1 2 3 4 5 6 7 8 9 1 ...
##  $ PERIOD                    : num [1:128069] 1 1 1 2 2 2 4 4 4 2 ...
##  $ GAME_CLOCK                : 'hms' num [1:128069] 01:09:00 00:14:00 00:00:00 11:47:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ SHOT_CLOCK                : num [1:128069] 10.8 3.4 NA 10.3 10.9 9.1 14.5 3.4 12.4 17.4 ...
##  $ DRIBBLES                  : num [1:128069] 2 0 3 2 2 2 11 3 0 0 ...
##  $ TOUCH_TIME                : num [1:128069] 1.9 0.8 2.7 1.9 2.7 4.4 9 2.5 0.8 1.1 ...
##  $ SHOT_DIST                 : num [1:128069] 7.7 28.2 10.1 17.2 3.7 18.4 20.7 3.5 24.6 22.4 ...
##  $ PTS_TYPE                  : num [1:128069] 2 3 2 2 2 2 2 2 3 3 ...
##  $ SHOT_RESULT               : chr [1:128069] "made" "missed" "missed" "missed" ...
##  $ CLOSEST_DEFENDER          : chr [1:128069] "Anderson, Alan" "Bogdanovic, Bojan" "Bogdanovic, Bojan" "Brown, Markel" ...
##  $ CLOSEST_DEFENDER_PLAYER_ID: num [1:128069] 101187 202711 202711 203900 201152 ...
##  $ CLOSE_DEF_DIST            : num [1:128069] 1.3 6.1 0.9 3.4 1.1 2.6 6.1 2.1 7.3 19.8 ...
##  $ FGM                       : num [1:128069] 1 0 0 0 0 0 0 1 0 0 ...
##  $ PTS                       : num [1:128069] 2 0 0 0 0 0 0 2 0 0 ...
##  $ player_name               : chr [1:128069] "brian roberts" "brian roberts" "brian roberts" "brian roberts" ...
##  $ player_id                 : num [1:128069] 203148 203148 203148 203148 203148 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   GAME_ID = col_double(),
##   ..   MATCHUP = col_character(),
##   ..   LOCATION = col_character(),
##   ..   W = col_character(),
##   ..   FINAL_MARGIN = col_double(),
##   ..   SHOT_NUMBER = col_double(),
##   ..   PERIOD = col_double(),
##   ..   GAME_CLOCK = col_time(format = ""),
##   ..   SHOT_CLOCK = col_double(),
##   ..   DRIBBLES = col_double(),
##   ..   TOUCH_TIME = col_double(),
##   ..   SHOT_DIST = col_double(),
##   ..   PTS_TYPE = col_double(),
##   ..   SHOT_RESULT = col_character(),
##   ..   CLOSEST_DEFENDER = col_character(),
##   ..   CLOSEST_DEFENDER_PLAYER_ID = col_double(),
##   ..   CLOSE_DEF_DIST = col_double(),
##   ..   FGM = col_double(),
##   ..   PTS = col_double(),
##   ..   player_name = col_character(),
##   ..   player_id = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
head(shots)
## # A tibble: 6 x 21
##    GAME_ID MATCHUP     LOCATION W     FINAL_MARGIN SHOT_NUMBER PERIOD GAME_CLOCK
##      <dbl> <chr>       <chr>    <chr>        <dbl>       <dbl>  <dbl> <time>    
## 1 21400899 MAR 04, 20~ A        W               24           1      1 01:09     
## 2 21400899 MAR 04, 20~ A        W               24           2      1 00:14     
## 3 21400899 MAR 04, 20~ A        W               24           3      1 00:00     
## 4 21400899 MAR 04, 20~ A        W               24           4      2 11:47     
## 5 21400899 MAR 04, 20~ A        W               24           5      2 10:34     
## 6 21400899 MAR 04, 20~ A        W               24           6      2 08:15     
## # ... with 13 more variables: SHOT_CLOCK <dbl>, DRIBBLES <dbl>,
## #   TOUCH_TIME <dbl>, SHOT_DIST <dbl>, PTS_TYPE <dbl>, SHOT_RESULT <chr>,
## #   CLOSEST_DEFENDER <chr>, CLOSEST_DEFENDER_PLAYER_ID <dbl>,
## #   CLOSE_DEF_DIST <dbl>, FGM <dbl>, PTS <dbl>, player_name <chr>,
## #   player_id <dbl>

After briefly exploring this data set, we notice there are 128069 observations with 21 variables. See the data dictionary below to see what each variable represents. Additionally, there is a mix of double and character data types throughout this data set. There is also a mix of numerical and categorical variables such as, measuring the shot distance as well as whether the game was at home or away.

Shot Log Data Dictionary

This dictionary describes what each variable of the data set represents.

GAME_ID

Identification number of a particular NBA game.

MATCHUP

The date the game occurred as well as the two teams that faced off.

LOCATION

Whether the team was playing at home (H) or away (A).

W Whether the team won (W) or lost (L).

FINAL_MARGIN

Final margin of victory or defeat for that game.

SHOT_NUMBER

The numbered shot attempt by a player.

PERIOD

The period in which the shot was attempted.

GAME_CLOCK

The game clock at the time the shot was attempted.

SHOT_CLOCK

The shot clock at the time the shot was attempted.

DRIBBLES

The number of dribbles prior to the shot attempted.

TOUCH_TIME

How long the player was holding the ball before shooting.

SHOT_DIST

How far away, in feet, the player was when shooting the ball.

PTS_TYPE

What kind of shot it was, either a 2 pointer or 3 pointer.

SHOT_RESULT

Whether the shot missed or made.

CLOSEST_DEFENDER

Who the closest defender was when the shot was taken.

CLOSEST_DEFENDER_PLAYER_ID

Identification (number) assigned to the closest defender when the shot was taken.

CLOSE_DEF_DIST

Distance of the closest defender to the person who took that shot (in feet).

FGM

Number of shots scored by the shot.

PTS

Number of points awarded for the shot.

Player_name

Name of the player who took the shot.

Player_id

identification(number) assigned to the player who took the shot.

Exploring the Different Shots Taken at Different Times

Our research questions we want to answer are what kind of shots are taken (in terms of distance from the basket) and how often they are being made at different times of the shot clock. In particular, we will analyze two players from the Golden State Warriors, Stephen Curry and Draymond Green.

The first part of cleaning this data set is to separate the MATCHUP column into two separate columns. The original column has both the date and the two teams playing, so it makes sense to create a separate DATE column for that information. Moreover, we want to convert the DATE column into a date type column using lubridate.

shots <- shots %>%
  separate(MATCHUP, into = c("DATE", "MATCHUP"), sep=" - ")

shots$DATE<-shots$DATE%>%mdy()

shots <- shots %>%
  mutate(TEAM = str_sub(shots$MATCHUP, 1, 3))

Another cleaning method we utilized is to reformat the player_name variable. The reason why we did this is because we want to make it consistent with the CLOSEST_DEFENDER column, which is formatted as “Last name, First name”. Because player_name is formatted as “first last” we will reverse the 2 parts of the name and add a comma between them. The rowwise() function allows it to collapse the vector for each row of the data frame.

shots$player_name<-str_to_title(shots$player_name)
shots$player_name<-str_split(shots$player_name, pattern = " ")
shots$player_name<-lapply(X = shots$player_name,FUN = rev)

shots <- shots %>% 
  rowwise() %>% 
  mutate(player_name = str_c(player_name, collapse = ", "))

Visualizing the Shot Map Based on Shot Clock

In order to analyze the shots taken and the time left on the shot clock for Stephen Curry and Draymond Green, we will utilize plotly so the user can hover over the data point and see the exact times and distances (Shot Clock, Shot Distance) at which the shot was taken. We will filter the data set by firstly the Golden State Warriors, and then Curry and Green. Moreover, we will color by the player so it is easier to visualize who shot which shot.

shots2<-shots %>% 
  filter(TEAM == "GSW") %>%
  filter(player_name %in% c("Curry, Stephen", "Green, Draymond"))

plot_ly(shots2, x = ~ SHOT_CLOCK , y = ~ SHOT_DIST)%>%
    layout(title = "Curry vs. Green Shot Selection (2014-15)") %>%
    add_markers(color = shots2$player_name,
               text = ~paste0('Player: ', player_name))
## Warning: Ignoring 34 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

Based on the visualization there are some features that stood out. Firstly, the spread and number of shots taken by Stephen Curry with respect to the distance; it goes to show that there are almost no kinds of shots he isn’t afraid to take. Comparing it to a defensive player like Draymond Green, we can see that Green had far fewer mid-range shots and shot mostly from 3 and in the paint. An interesting cluster of points at the bottom right of the plot representing points scored by Draymond Green very close to the basket might attest to the fact that Green anticipates turnovers better than Curry; as a result, he is readily available to receive outlet passes and shoot in the paint at the start of the possession.

Visualzing the Shot Frequency Based on Distance from the Basket

The plot below displays a histogram representing the frequency of shots taken based on distance from the basket. We will create a facet plot for six players (Stephen Curry, LeBron James, James Harden, Kyrie Irving, DeMarcus Cousins and Anthony Davis). This can better observe how these players take different shots at different times of the shot clock.

shots %>%
  filter(player_name %in% c("Curry, Stephen", "James, Lebron", "Harden, James", "Irving, Kyrie", "Davis, Anthony", "Cousins, Demarcus")) %>%
  ggplot() + 
  geom_histogram(aes(x = SHOT_DIST, fill = player_name)) +
  labs(title = "Shot Usage by Distance Among Top NBA Players (2014-15)", 
       x = "Shot Distance (Feet)", 
       y = "Shot Usage") + 
  facet_wrap(~ player_name) + 
  theme_clean() + 
  theme(legend.position = "none")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Based on the figure above, we see that guards have a tendency to shoot from three more often than forwards. This is not surprising as the guards have different skill sets and responsibilities compared to forwards. What is interesting is that all players do not shoot from mid-range as often, which suggests the change in the game in recent years. In addition, James Harden shot more threes than Stephen Curry, despite Curry is most synonymous with 3-point shots.

Exploring the 2-pointer vs. 3-pointer Shots

We want to compare the efficacy between the 2-point and 3-point shots. Our goal is to answer what is the percentage of 2 pointers vs 3 pointers in the 2014-2015 season? Additionally, what type of shot is the most successful?

2-pointers vs. 3-pointers Attempted in the 2014-15 season

In order to answer the first question, we first created another column prop which calculates the proportion of shots attempted between 2 and 3 pointers. Additionally, we factored the PTS_TYPE in order to make it discrete instead of continuous. Afterwards, we created a pie chart representing the proportions.

shotcount <- shots %>%
  group_by(PTS_TYPE) %>%
  summarize(countmm = n()) %>%
  mutate(prop = countmm/sum(countmm))
shotcount
## # A tibble: 2 x 3
##   PTS_TYPE countmm  prop
##      <dbl>   <int> <dbl>
## 1        2   94173 0.735
## 2        3   33896 0.265
shotcount$PTS_TYPE<-factor(shotcount$PTS_TYPE)

ggplot(shotcount, aes(x="", y=prop, fill=PTS_TYPE)) +
  geom_bar(stat="identity", width=1) +
  ggtitle("Percentage of 2 point and 3 point Shots Attempted (2014-15)") +
  coord_polar("y", start=0)+theme_void() +
  scale_fill_brewer(palette="Set1")  

Based on the figure, 73.5 percent of shots were 2 pointers in the 2014-2015 NBA season. 26.5 % of shots were 3 pointers. To further explore this data, we decided to see what percentage of 2 pointers and 3 pointers were actually made during the season.

2-pointers vs. 3-pointers Made in the 2014-15 season

Because we’re only focusing on shots that were made, we filtered the data set as such. The process afterwards is similar to as the previous question. The only difference is we created a bar graph comparing the two shot types.

shotresult <- shots %>%
  filter(SHOT_RESULT=="made") %>%
  group_by(  SHOT_RESULT, PTS_TYPE) %>%
  summarize(countmm = n()) %>%
  mutate(prop = countmm/sum(countmm))
## `summarise()` has grouped output by 'SHOT_RESULT'. You can override using the `.groups` argument.
shotresult1 <- shotresult %>%
mutate(prop1=prop*100)
shotresult1
## # A tibble: 2 x 5
## # Groups:   SHOT_RESULT [1]
##   SHOT_RESULT PTS_TYPE countmm  prop prop1
##   <chr>          <dbl>   <int> <dbl> <dbl>
## 1 made               2   45990 0.794  79.4
## 2 made               3   11915 0.206  20.6
shotresult1$PTS_TYPE<-factor(shotresult1$PTS_TYPE)


ggplot(shotresult1, aes(x=PTS_TYPE, y=prop1,  fill =PTS_TYPE)) + 
  geom_bar(stat="identity" )+ ggtitle("Percentage of 2 point and 3 point Shots Made (2014-15)") +
  xlab("Type of Points") + ylab("Percentage")+ theme_economist()+
  theme(legend.title=element_blank())

Based on the figure, we found that 79.4% of all made shots were 2-pointers while 20.6% were 3-pointers throughout the season, which shows that despite the rise in the 3-point shot, the reliance on the 2-point shot is still strong.